Mitigation of Failures in High Performance Computing via Runtime Techniques
نویسنده
چکیده
As machines increase in scale, it is predicted that failure rates of supercomputers will correspondingly increase. Even though the mean time to failure (MTTF) of individual component is high, the large number of components significantly decreases the system MTTF. Meanwhile, the decreasing size of transistors has been critical to the increase in capacity of supercomputers. The smaller the transistors are, silent data corruptions (SDC) are likely to occur more frequently. SDCs do not inhibit execution, but may silently lead to incorrect results. In this thesis, we leverage runtime system and compiler techniques to mitigate a significant fraction of failures automatically with low overhead. The main goals of various systemlevel fault tolerance strategies designed in this thesis are: reducing the extra cost added to application execution while improving system reliability; automatically adjusting fault tolerance decisions without user intervention based on environmental changes; protecting applications not only from fail-stop failures but also from silent data corruptions. The main contributions of this thesis are development of a semi-blocking checkpoint protocol that overlaps application execution with fault tolerance operation to reduce the overhead of checkpointing, a runtime system technique for automatic checkpoint and restart without user intervention, a holistic framework (ACR) for automatically detecting and recovering from silent data corruptions and a framework called FlipBack that provides targeted protection against silent data corruption with low cost.
منابع مشابه
Data Replication-Based Scheduling in Cloud Computing Environment
Abstract— High-performance computing and vast storage are two key factors required for executing data-intensive applications. In comparison with traditional distributed systems like data grid, cloud computing provides these factors in a more affordable, scalable and elastic platform. Furthermore, accessing data files is critical for performing such applications. Sometimes accessing data becomes...
متن کاملShrink or Substitute: Handling Process Failures in HPC Systems using In-situ Recovery
Efficient utilization of today’s high-performance computing (HPC) systems with complex hardware and software components requires that the HPC applications are designed to tolerate process failures at runtime. With low mean-time-tofailure (MTTF) of current and future HPC systems, long running simulations on these systems require capabilities for gracefully handling process failures by the applic...
متن کاملPreventing Key Performance Indicators Violations Based on Proactive Runtime Adaptation in Service Oriented Environment
Key Performance Indicator (KPI) is a type of performance measurement that evaluates the success of an organization or a partial activity in which it engages. If during the running process instance the monitoring results show that the KPIs do not reach their target values, then the influential factors should be identified, and the appropriate adaptation strategies should be performed to prevent ...
متن کاملExtraTime: A Framework for Exploration of Clock and Power Gating for BTI and HCI Aging Mitigation
Bias Temperature Instability (BTI) and Hot Carrier Injection (HCI) are two major causes for transistor aging at nano-scale, leading to slower devices, more failures during runtime, and ultimately reduced lifetime. Typically these issues are handled by adding extra guardbands to the design, i. e. overdesign, which results in lower clock frequencies and hence, performance losses. Alternatively, e...
متن کاملTASA: A New Task Scheduling Algorithm in Cloud Computing
Cloud computing refers to services that run in a distributed network and are accessible through common internet protocols. It merges a lot of physical resources and offers them to users as services according to service level agreement. Therefore, resource management alongside with task scheduling has direct influence on cloud networks’ performance and efficiency. Presenting a proper scheduling ...
متن کامل